Skip to content

feat(app): migrate app and ante telemetry to OpenTelemetry Meter API#3396

Open
amir-deris wants to merge 18 commits into
mainfrom
amir/plt-323-migrate-app-metrics-to-otel-meter-api
Open

feat(app): migrate app and ante telemetry to OpenTelemetry Meter API#3396
amir-deris wants to merge 18 commits into
mainfrom
amir/plt-323-migrate-app-metrics-to-otel-meter-api

Conversation

@amir-deris
Copy link
Copy Markdown
Contributor

@amir-deris amir-deris commented May 5, 2026

Summary

Migrates telemetry in the app and app/ante packages from the legacy telemetry/utilmetrics helpers to the standardized OpenTelemetry Meter API, following the same pattern established in #3265 for evmrpc.

  • Adds app/metrics.go with a struct-based OTel instrument set registered via otel.Meter("app"), initialized once in NewApp after SetupOtelMetricsProvider()
  • Adds app/ante/metrics.go with a lazily-initialized anteMetrics struct via sync.OnceValue (fires on first CheckTx, after the global MeterProvider is set)
  • Replaces telemetry.MeasureSince / utilmetrics calls in abci.go and invariance.go with dual-emitted OTel instruments; legacy calls remain with TODO(PLT-327) markers until dashboards are migrated
  • Threads ctx/ctx.Context() through all .Record() and .Add() call sites to support OTel's context-based propagation
  • Replaces the per-BeginBlock GaugeSeidVersionAndCommit call with an app_build_info observable gauge that fires on Prometheus scrape — no per-block overhead

New metrics (OTel naming convention, exported via the process-wide MeterProvider)

ABCI phase durations — histograms bucketed at SLO thresholds (p50 ≤ 500ms, p95 ≤ 1.5s, p99 ≤ 2.5s):

  • app_abci_begin_block_duration_seconds
  • app_abci_end_block_duration_seconds
  • app_abci_module_end_block_duration_seconds
  • app_abci_check_tx_duration_seconds
  • app_abci_deliver_tx_duration_seconds
  • app_abci_deliver_batch_tx_duration_seconds
  • app_block_process_duration_seconds — block tx processing duration by execution type

Transaction counters and gas:

  • app_tx_count_total
  • app_tx_process_type_total — by execution type label
  • app_tx_gas_total — cumulative gas by type label
  • app_tx_gas_used / app_tx_gas_wanted — per-tx gauges

App-level flow counters:

  • app_optimistic_processing_total — cache hit (enabled=true) vs miss
  • app_failed_total_gas_wanted_check_total
  • app_giga_fallback_to_v2_total

Light invariance:

  • app_lightinvariance_supply_duration_seconds
  • app_lightinvariance_supply_invalid_key_total
  • app_lightinvariance_supply_unmarshal_failure_total

Build info:

  • app_build_info — observable gauge, always 1, labels: seid_version, commit

Ante:

  • app_pending_nonce_total — pending nonce events by type (added, expired, rejected, accepted)

Migration note

Legacy metrics (telemetry.MeasureSince, utilmetrics.MeasureDeliverTxDuration, etc.) are dual-emitted during the migration window so existing dashboards continue to receive data. Each legacy call site is annotated TODO(PLT-327) and will be removed once dashboards are verified against the new OTel series.


Note

Medium Risk
Touches hot ABCI/tx execution paths to add new OpenTelemetry metric recording and thread contexts through invariance checks; functional behavior should be unchanged but any mistakes could impact performance or panic on misconfigured meter providers.

Overview
Adds new OpenTelemetry Meter-based instrumentation for the app and app/ante packages via app/metrics.go and app/ante/metrics.go, with one-time initialization in New().

Updates ABCI handlers (BeginBlock, EndBlock, CheckTx, DeliverTx, DeliverTxBatch, Commit), block processing paths (sync/OCC/giga), and light invariance checks to record new OTel histograms/counters (durations, tx counts, gas totals, optimistic processing, gas-wanted rejections, giga fallbacks, invariance failures) while temporarily dual-emitting legacy telemetry/utilmetrics metrics behind TODO(PLT-327).

Replaces per-block build info gauge emission with an app_build_info observable gauge and adjusts LightInvarianceChecks to accept a context.Context so OTel metrics can use the correct context.

Reviewed by Cursor Bugbot for commit f2384dd. Bugbot is set up for automated code reviews on this repo. Configure here.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 5, 2026

The latest Buf updates on your PR. Results from workflow Buf / buf (pull_request).

BuildFormatLintBreakingUpdated (UTC)
✅ passed✅ passed✅ passed✅ passedMay 12, 2026, 6:09 PM

@amir-deris amir-deris changed the title Added otel metrics for app and app/ante feat(app): migrate app and ante telemetry to OpenTelemetry Meter API May 6, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented May 6, 2026

Codecov Report

❌ Patch coverage is 67.20000% with 82 lines in your changes missing coverage. Please review.
✅ Project coverage is 59.37%. Comparing base (26636ba) to head (f2384dd).

Files with missing lines Patch % Lines
app/app.go 45.90% 32 Missing and 1 partial ⚠️
app/invariance.go 30.76% 18 Missing ⚠️
app/metrics.go 89.62% 9 Missing and 2 partials ⚠️
app/abci.go 75.00% 10 Missing ⚠️
app/ante/evm_checktx.go 0.00% 8 Missing ⚠️
app/ante/metrics.go 77.77% 1 Missing and 1 partial ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #3396      +/-   ##
==========================================
+ Coverage   59.34%   59.37%   +0.02%     
==========================================
  Files        2112     2114       +2     
  Lines      174772   174919     +147     
==========================================
+ Hits       103724   103859     +135     
- Misses      62010    62019       +9     
- Partials     9038     9041       +3     
Flag Coverage Δ
sei-chain-pr 50.80% <67.20%> (?)
sei-db 70.41% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
app/ante/metrics.go 77.77% <77.77%> (ø)
app/ante/evm_checktx.go 34.72% <0.00%> (-0.66%) ⬇️
app/abci.go 63.19% <75.00%> (+0.69%) ⬆️
app/metrics.go 89.62% <89.62%> (ø)
app/invariance.go 55.81% <30.76%> (+14.77%) ⬆️
app/app.go 68.15% <45.90%> (-1.04%) ⬇️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Copy Markdown
Contributor

@bdchatham bdchatham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Migration shape is right and dual-emit is sound. Three blockers worth fixing before merge — all silent (no runtime error, just wrong data):

  1. Counter and histogram names will double-suffix under the Prometheus exporter (_total_total, _seconds_seconds).
  2. app_tx_count_total double-counts every tx via the unlabeled + labeled Add pair.
  3. txGasUsed / txGasWanted as Int64Gauge is last-write-wins under OCC concurrency.

Plus should-fixes around bucket density at the p99=2.5s SLO threshold, the lazy sync.OnceValue init in app/ante/, and the raw proposer label encoding. Inline below.

Dashboard rewrites land alongside the PLT-327 removal step; existing legacy refs in clusters/prod/monitoring/grafana-dashboards-protocol.yaml are summary-typed ({quantile="0.5"}), so query rewrites are histogram_quantile-based, not 1:1. No alert/recording rules touch these names — the dual-emit window is sufficient buffer.

Comment thread app/metrics.go Outdated
Comment thread app/metrics.go Outdated
Comment thread app/metrics.go Outdated
Comment thread app/abci.go Outdated
Comment thread app/ante/metrics.go Outdated
Comment thread app/app.go Outdated
Comment thread app/ante/metrics.go Outdated
// InitAnteMetrics registers all OTel instruments for the ante package.
// Safe to call concurrently; instruments are registered exactly once.
func InitAnteMetrics() {
appAnteMetrics.mu.Lock()
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use sync.Once instead?

@amir-deris amir-deris requested a review from masih May 7, 2026 23:12
Comment thread app/metrics.go
Comment on lines +24 to 26
var millisecondBuckets = metric.WithExplicitBucketBoundaries(
0.000025, 0.000050, 0.0001, 0.0005, 0.001, 0.0025, 0.005, 0.010, 0.020, 0.050, 0.075, 0.1,
)
Copy link
Copy Markdown
Contributor

@bdchatham bdchatham May 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I'd add 0.25, 0.5, 1.0commitDuration is on this set, and pebble compaction stalls during heavy write load can push commit past 100ms (verified by walking through BaseApp.Commitcms.Commit(true) — the foreground write blocks during L0→L1 stalls). 100ms ceiling buckets that exactly when you most want tail visibility.

Comment thread app/metrics.go
Comment thread app/app.go Outdated
Comment thread app/metrics_test.go
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 7446303. Configure here.

Comment thread app/abci.go Outdated
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants